Goto

Collaborating Authors

 dropout rate


A Appendix 564 B Diffusion process as ODE

Neural Information Processing Systems

In this section, we show that Cold Sampling is an approximation of the Euler method for (5). The intuition is as follows. B.2 Why is cold sampling better than naive sampling? Naive sampling does not have this property. The proof relied on applying definitions of Lipschitz functions twice.


Synergizing Deconfounding and Temporal Generalization For Time-series Counterfactual Outcome Estimation

Liu, Yiling, Dong, Juncheng, Fu, Chen, Shi, Wei, Jiang, Ziyang, Hua, Zhigang, Carlson, David

arXiv.org Artificial Intelligence

Estimating counterfactual outcomes from time-series observations is crucial for effective decision-making, e.g. when to administer a life-saving treatment, yet remains significantly challenging because (i) the counterfactual trajectory is never observed and (ii) confounders evolve with time and distort estimation at every step. To address these challenges, we propose a novel framework that synergistically integrates two complementary approaches: Sub-treatment Group Alignment (SGA) and Random Temporal Masking (RTM). Instead of the coarse practice of aligning marginal distributions of the treatments in latent space, SGA uses iterative treatment-agnostic clustering to identify fine-grained sub-treatment groups. Aligning these fine-grained groups achieves improved distributional matching, thus leading to more effective deconfounding. We theoretically demonstrate that SGA optimizes a tighter upper bound on counterfactual risk and empirically verify its deconfounding efficacy. RTM promotes temporal generalization by randomly replacing input covariates with Gaussian noises during training. This encourages the model to rely less on potentially noisy or spuriously correlated covariates at the current step and more on stable historical patterns, thereby improving its ability to generalize across time and better preserve underlying causal relationships. Our experiments demonstrate that while applying SGA and RTM individually improves counterfactual outcome estimation, their synergistic combination consistently achieves state-of-the-art performance. This success comes from their distinct yet complementary roles: RTM enhances temporal generalization and robustness across time steps, while SGA improves deconfounding at each specific time point.


Checklist

Neural Information Processing Systems

The checklist follows the references. For example: Did you include the license to the code and datasets? Did you include the license to the code and datasets? Did you include the license to the code and datasets? Please do not modify the questions and only use the provided macros for your answers.



Continuum Dropout for Neural Differential Equations

Lee, Jonghun, Oh, YongKyung, Kim, Sungil, Lim, Dong-Young

arXiv.org Machine Learning

Neural Differential Equations (NDEs) excel at modeling continuous-time dynamics, effectively handling challenges such as irregular observations, missing values, and noise. Despite their advantages, NDEs face a fundamental challenge in adopting dropout, a cornerstone of deep learning regularization, making them susceptible to overfitting. To address this research gap, we introduce Continuum Dropout, a universally applicable regularization technique for NDEs built upon the theory of alternating renewal processes. Continuum Dropout formulates the on-off mechanism of dropout as a stochastic process that alternates between active (evolution) and inactive (paused) states in continuous time. This provides a principled approach to prevent overfitting and enhance the generalization capabilities of NDEs. Moreover, Continuum Dropout offers a structured framework to quantify predictive uncertainty via Monte Carlo sampling at test time. Through extensive experiments, we demonstrate that Continuum Dropout outperforms existing regularization methods for NDEs, achieving superior performance on various time series and image classification tasks. It also yields better-calibrated and more trustworthy probability estimates, highlighting its effectiveness for uncertainty-aware modeling.


A Appendix

Neural Information Processing Systems

's are feedforward networks with three residual blocks, f We often place restrictions on the derivations to operationalize domain-specific constraints. 's are assigned to 0. In practice these constraints are implemented A.2 Lower Bound Derivation log p Due to memory constraints, in practice we use a batch size of 1 and simulate larger batch sizes through gradient accumulation. We observed training to be somewhat unstable and some datasets (e.g. For SCAN, all models and embeddings are 256-dimensional. We tune over the number of layers, hidden units, and dropout rate.


A T able of Notation Notation Description λ, w Hyperparameters and parameters L T, L

Neural Information Processing Systems

T able 2: A summary of notations used in this paper. In this section, we present the training algorithm for Self-Tuning Networks. Let A and B be square positive definite matrices. C.31, we get: r λ (λ) = null D.4 can be represented as: E The second term in Eqn D.15 is: E Therefore, first and second terms correspond to the first-and second-order Taylor approximations to the loss. In this section, we describe a structured best-response approximation for convolutional layers.


Supplementary Material SSAL: Synergizing between Self-Training and Adversarial Learning for Domain Adaptive Object Detection Supplementary Material

Neural Information Processing Systems

In this supplementary material, following sections are discussed: we include training algorithm (Sec. In particular, we see a maximum decrease of 0.8% in mAP score when increasing Although we set both thresholds at 0.5, we find that our method is relatively robust to these hyperparameters. Tab. 4 reveals that a model trained on source domain (Sim10k [ Calibration is measured using ECE score. 2 Models ECE Score Source Only 0.25 Oracle 0.10 Detections missed by the EPM and found by our method are shown in Blue.


Supplementary Material

Neural Information Processing Systems

It is worth noting that, Eq. In Section 4.1, we have shown the experimental results of HPM on two population synthetic functions, It is worth noting that, since the synthetic function only simulates the validation loss function ( i.e., The same exploit strategy in PBT, i.e., truncation selection [ All the codes on the synthetic functions were implemented with Autograd. Same to the Figure 1 in Section 4.1, we show the mean performance We show the details of hyperparameters we tuned on the benchmark datasets as follows. The tied weight is used for the embedding and softmax layer.